Lab 3.1 - Bootstrapping
Creating a bootstrap simulation
For this assignment, we will be using data from the Washington D.C. bicycle sharing system. In particular, we want to try and create accurate confidence intervals for various features of the dataset.
You can find the variable definitions here.
Expectations
- Before starting, write a sentence or two on your expectation of the confidence interval for two variables
cnt
(number of bicycles rented that day) andtemp
Setup
Create a sample
- Make a subset of the data that is a random sample of size 50 using the following command:
dcbikes_sample <- dcbikes %>%
slice_sample(n = 50, replace = TRUE)
Classical confidence intervals
Check the conditions and create the 95% confidence intervals by hand for the two variables
Compare these results to the overall dataset values - did your confidence interval cover the true value? How close was your sample to reality?
Bootstrapping
- Using the
rep_sample_n(size = 50, replace = TRUE, reps=100000)
function, sampling with replacement. Make sure to follow the code example on the moderndive webpage to generate your sampling distribution of means.
To run create replications, you can so so by modifying the following sample code:
Note: VERY IMPORTANT - use your sample of size 50, and then sample from THAT sample.
virtual_resamples <- pennies_sample %>%
rep_sample_n(size = 50, replace = TRUE, reps = 10000)
To generate a list of sample means, in particular, you’ll need to modify this sample code to correspond to your data:
virtual_resampled_means <- virtual_resamples %>%
group_by(replicate) %>%
summarize(mean_year = mean(year))
Remember that when bootstrapping, you should make sure that the
size
argument is set to be the size of your sample - you want to use all of the information possible from your sample!
- Create a confidence interval using the
quantile()
function on this sampling distribution. How do these compare to your confidence intervals calculated in the classical way?
In particular, you want to check cutoff at the 0.025 and 0.975 range of your data (the range in which 95% of the sample means fell).
Hint: you can find documentation on using the
quantile()
function by typing?quantile
in the Console window
Compare the result of this confidence interval you generated by bootstrapping to the ones calculated by classical methods. How close were they? Do the differences surprise you or not?
Think carefully about what the difference is between a confidence interval calculated by classical methods and the one generated by bootstrapping. What are the differences in key assumptions?
If you have extra time, you can try to use the alternative workflow described here.
If you still have extra time, you can try to bootstrap regression lines via the method described here